feat(srt): support prefill and generate with `input_embeds` #2082

XuehaiPan · 2024-11-18T18:08:22Z

Motivation

Resolves #745

Modifications

As per the commit messages.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

merrymercy · 2024-11-18T21:00:58Z

Thanks for the contribution. There is a related PR recently. Can you take a review on that? #2052

merrymercy

Thanks for the contribution! I left a few comments.
Can you add a test case for llama and llava?

.pre-commit-config.yaml

merrymercy · 2024-11-22T09:04:35Z

python/sglang/srt/managers/io_struct.py

 from enum import Enum
 from typing import Dict, List, Optional, Union

 from sglang.srt.managers.schedule_batch import BaseFinishReason
 from sglang.srt.sampling.sampling_params import SamplingParams

+# Use sequence instead of Tensor here because Pydantic serializes Python objects


sequence or list?

merrymercy · 2024-11-22T09:05:53Z

python/sglang/srt/managers/io_struct.py

+    if sys.version_info >= (3, 10):
+        _: dataclasses.KW_ONLY


What is this used for?

merrymercy · 2024-11-22T09:06:45Z

python/sglang/srt/managers/schedule_batch.py

@@ -430,6 +435,9 @@ def __repr__(self):
 class ScheduleBatch:
    """Store all inforamtion of a batch."""

+    if sys.version_info >= (3, 10):


Is it possible to get rid of this?

merrymercy · 2024-11-22T09:10:48Z

python/sglang/srt/managers/schedule_batch.py

@@ -876,7 +902,7 @@ def check_for_jump_forward(self, pad_input_ids_func):
                    jump_forward_reqs.append(req)
                    keep_indices.remove(i)

-        self.filter_batch(keep_indices=list(keep_indices))
+        self.filter_batch(keep_indices=sorted(keep_indices))


why is sorted better here?

merrymercy · 2024-11-22T09:12:36Z

python/sglang/srt/managers/scheduler.py

+                (
+                    logits_output,
+                    next_token_ids,
+                    next_token_embeds,


Why do you need next_token_embeds? I think after the first prefill, we can use token ids and do not need to take embedding inputs anymore.

merrymercy · 2024-11-22T09:14:56Z

python/sglang/srt/managers/tp_worker_overlap_thread.py

+            (
+                logits_output,
+                next_token_ids,
+                next_token_embeds,


I think next_token_embeds is probably not necessary here. It makes things much more complicated.
Some of your handling here is not correct as you need to handle the copy of them correctly.
Ideally, we can get rid of next_token_embeds and do not need to change this file.

merrymercy · 2024-11-22T09:15:27Z

python/sglang/srt/model_executor/forward_batch_info.py

@@ -211,6 +218,11 @@ def init_new(
            forward_mode=batch.forward_mode,
            batch_size=len(batch.seq_lens),
            input_ids=batch.input_ids,
+            input_embeds=(
+                batch.input_embeds.clone().detach().to(device)


Can we get rid of this extra copy?

merrymercy · 2024-11-22T09:16:26Z

python/sglang/srt/model_executor/model_runner.py

    def forward_decode(self, forward_batch: ForwardBatch):
        if self.cuda_graph_runner and self.cuda_graph_runner.can_run(forward_batch):
            return self.cuda_graph_runner.replay(forward_batch)

        forward_batch.positions = (forward_batch.seq_lens - 1).to(torch.int64)
        self.attn_backend.init_forward_metadata(forward_batch)
+
+        if forward_batch.input_embeds is not None:


we probably only need this input_embeds for prefill.

merrymercy · 2024-11-22T09:46:57Z

I feel #2052 is probably a cleaner solution.

…sent

…infer` is not installed

… generation mode

merrymercy · 2024-11-25T11:38:13Z

will close this one in favor of #2052

XuehaiPan requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 18, 2024 18:08

XuehaiPan changed the title ~~feat(srt/io_struct): support prefill and generate with input_embeds~~ feat(srt): support prefill and generate with input_embeds Nov 18, 2024

XuehaiPan force-pushed the generation-input-embeds branch 2 times, most recently from 857750a to 8058e22 Compare November 18, 2024 18:47

merrymercy mentioned this pull request Nov 18, 2024

Input_embeds support #2052

Merged

3 tasks

XuehaiPan force-pushed the generation-input-embeds branch from 976f2c0 to 98331c0 Compare November 19, 2024 17:58

merrymercy added the high priority label Nov 22, 2024

merrymercy requested changes Nov 22, 2024

View reviewed changes

merrymercy removed the high priority label Nov 22, 2024

XuehaiPan marked this pull request as draft November 22, 2024 13:10

XuehaiPan force-pushed the generation-input-embeds branch from 98331c0 to 62e3104 Compare November 22, 2024 13:16

XuehaiPan added 5 commits November 23, 2024 15:03

feat(srt/io_struct): add input_embeds to data structures

1da1863

feat(srt/model_runner): forward generation with input_embeds if pre…

90dadbc

…sent

fix(srt/utils): return False in is_flashinfer_available() if `flash…

e33d95f

…infer` is not installed

feat(srt/scheduler): concat output_embeds back to input_embeds in…

a5ad364

… generation mode

fix(srt/models): add input_embeds to Llava model's forward

4e28940

XuehaiPan force-pushed the generation-input-embeds branch from 62e3104 to 4e28940 Compare November 23, 2024 07:03

merrymercy closed this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(srt): support prefill and generate with `input_embeds` #2082

feat(srt): support prefill and generate with `input_embeds` #2082

XuehaiPan commented Nov 18, 2024 •

edited

Loading

merrymercy commented Nov 18, 2024

merrymercy left a comment

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy Nov 22, 2024

merrymercy commented Nov 22, 2024

merrymercy commented Nov 25, 2024

feat(srt): support prefill and generate with input_embeds #2082

feat(srt): support prefill and generate with input_embeds #2082

Conversation

XuehaiPan commented Nov 18, 2024 • edited Loading

Motivation

Modifications

Checklist

merrymercy commented Nov 18, 2024

merrymercy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merrymercy commented Nov 22, 2024

merrymercy commented Nov 25, 2024

feat(srt): support prefill and generate with `input_embeds` #2082

feat(srt): support prefill and generate with `input_embeds` #2082

XuehaiPan commented Nov 18, 2024 •

edited

Loading